About the project

Introduction to Open Data Science is a course organized by Doctoral school in the humanities and social sciences (HYMY),University of Helsinki. As it is mentioned in the course page “The name of this course refers to THREE BIG THEMES: 1) Open Data, 2) Open Science, and 3) Data Science”. The learning platform for this course is MOOCs (Massive Open Online Courses) of the University of Helsinki: http://mooc.helsinki.fi. More information about the course can be found in this link: https://courses.helsinki.fi/en/hymy005/120776718.


My github repository for IODS-project

GitHub


Tools and Methods for open and reproducible research

In the first week of the course, I have gone through the general instructions of the course and get familiarize my self with the learning tools, the course content and schedule. Though I had previously experience with R and RStudio, I have done all the exercises and instructions given in DataCamp: R Short and Sweet and refresh my R. I completed all R Short and Sweet exercise and statement of accomplishment published here. The other exerices, I have done in this first week is RStudio Exercise 1. I already had a GitHub account and to follow the exercises and instructions for this exercise, I forked the IODS-project repository from Tuomo Nieminen’s github. After forking, I clone the IODS-project repository to my local machine (my computer) using the command git clone. After cloned the respostry to my computer, Using RStudio, I edited the existing Rmarkdown file (chapter1.Rmd, chapter2.Rmd and index.Rmd) in the repository and also add new Rmarkdown file and save as in the file name README.Rmd. Next, I commit the changes what have been done in the local machine using the git command git commit -m and finally push to github using the command git push. I have also canged the theme of my course diary web page to Merlot theme. In this exercises, I refresh my git and Github knowledge and also familiarize with how I will use the GitHub for this course to share and publish my learning diary.


Regression and Model validation

Data wrangling

In this section, the data set given in this link has been preprocess for further/downstream analysis. After creating a folder named ‘data’ in my IODS-project repository, using Rstudio a new R script named ‘create_learning2014’ file created and write all the steps used in the data wrangling process and saved in the ‘data’ folder. All the steps I have done in the data wrangling preprocess can be find here and/or in the code shown below.

Analysis

The data for this section is extracted from a survey conducted by Kimmo Vehkalahti on teaching and learning approaches. The research project made possible by the Academy of Sciences funding to Teachers’ Academy funding (2013-2015). The data was collected from 3.12.2014 - 10.1.2015.The surrvey includes 183 respondants and the questioner in toatal includes 60 variables/questions.

students2014=read.table("data/learning2014.txt", sep="\t", header=TRUE)
dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(students2014)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

After preprocess the raw data, the final data set have 7 column “gender, age, attitude, deep, stra, surf and points” and 166 indvidual as shown in the above run.

plot_students2014=ggpairs(students2014, mapping = aes(col = gender,alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

plot_students2014

The above plot shown there are 110 femal and 56 male respondants in the survey. Moreover the plot shown how the variables (Points, Age, attitude, deep, stra and Surf) are correleted. Based on the absolute correlation value the attitude and points is higly correleted while deep learning and ponts are least corrleted.

The explanatory variable chosen based on (absolute) correlation value and the top three explanatory variable chosen are attitude= 0.437, stra= 0.146 and surf= -0.144. Using this variables the model is build using the R function “lm”.

my_model1 <- lm(Points ~ attitude + stra + surf, data = students2014)

summary(my_model1)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The significance paramter from the above summary table is the intercept 0.00322 and attitude (1.93e-08) ***. Hence stra nad surf are not condisdered in the model below. using again “lm” function linear regrression model build between points and attitude.The model after removing insignificant variables is summarized below. With regard to multiple r-squared value, we saw that value changed from 0.1927 (in older model) to 0.1856 (newer model). However, F-Statistic (from 14.13 to 38.61) and p-value(3.156e-08 to 4.119e-09) have significantly improved. Thus, we can conclude that r-squared value may not always determine the quality of the model and the lower r-squared value might be due to outliers in the data.

my_model2 <- lm(Points ~ attitude , data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Model Validation

par(mfrow = c(2,2))

plot(my_model2, which=c(1,2,5))

The three diagnostic model validation plots are shown above.The assumptions are

  • The errors are normally distributed
  • The errors are not correleted
  • The errors have constant variance,
  • The size of a given error doesn’t depend on the explanatory variables

Based on the above plots, we can conclude that the errors are normally distributed (clearly observed in q-q plot). Similarly, residual versus fitted model showed that the errors are not dependent on the attitude variable. Moreover, we can see that even two points (towards the right) have minor influence to the assumption in case of residual vs leverage model. All the models have adressed the outliers nicely. Thus, assumptions in all models are more or less valid.


Logistic regression

Data wrangling

library(GGally)
library(ggplot2)
library(ggpubr)
## Loading required package: magrittr
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
library(gmodels)
library(boot)

In this section Data wrangling has been done for two data sets retrieved from UCI Machine Learning Repository. The R script used for data wrangling can be found here

Analysis

The data setes used in this analysis were retrieved from the UCI Machine Learning Repository.The data sets are intend to apprach student achievement in two secondary education Portuguese schools. The data was collected by using school reports and questionaires that includes data alttributes (student grades, demographic, social and school related features). Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por) (Cortez et al. 2008). For this analysis, the original dataset (mat and por) have been modified so that the two separate datasets have been joined. Only students who are answered the questionnaire in both portuguese classe are kept. The final data sets used in this analysis includes a total of 382 observation and 35 attributes.

modified_data= read.table("data/modified_data.txt", sep="\t", header=TRUE)

colnames(modified_data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Inorder to study the relationship between high/low alcohol consumption and variables. I chose four variables (“absences”, “sex”, “goout” and “studytime”). My hypothesis about how each variable is related to alcohol consumption shown below:

  • studytime: Students who dedicate quite much amount of time in thire study, they don’t have time to go out for drinking alchol (studytime and high/low alcohol consumption negatively correlated)

  • sex: Male students have higher alcohol consumption than female stduents (Male students and high/low alcohol consumption positively correlated)

  • goout: Those students going out with friends quite frequently consume high alchol than students don’t going out. (goout and high/low alcohol consumption positively correlated)

  • absences: Those students who consume more alachol don’t attend class for various reason (e.g hangover, attending party in class time ) (absences and high/low alcohol consumption positively correlated)


The distributions of my chosen variables and their relationships with alcohol consumption


A bar plot for demonstrating the distributions of my chosen variable

my_variables= c("absences", "sex", "goout", "studytime", "high_use")

my_variables_data <- select(modified_data, one_of(my_variables))

colnames(my_variables_data)
## [1] "absences"  "sex"       "goout"     "studytime" "high_use"
gather(my_variables_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped


A Box plot for demonstrating my chosen variables and their relationships with alcohol consumption

g1 <- ggplot(modified_data, aes(x = high_use, y = absences,col=sex))

p1=g1 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

g2 <- ggplot(modified_data, aes(x = high_use, y = goout, col=sex))

p2=g2 + geom_boxplot() + ylab("going out with friends")  + ggtitle("Student who going out with friends by alcohol consumption and sex") 

g3 <- ggplot(modified_data, aes(x = high_use, y = studytime, col=sex))

p3=g3 + geom_boxplot() + ylab("studytime - weekly study time")  + ggtitle("Student who dedicate time to study by alcohol consumption and sex") 

ggarrange(p1, p2, p3 , labels = c("A", "B","C"), ncol = 2, nrow = 2)

#sex_high_use <- CrossTable(my_variables_data$sex, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#studytime_high_use <- CrossTable(my_variables_data$studytime, my_variables_data$high_use)


To statistically explore the relationship between my chosen variable and high/low alcohol consumption variable logistic regression moded was build using the R function glm().

m <- glm(high_use ~  absences + sex +  goout + studytime , data = modified_data, family = "binomial")


Inorder to be able to interpret the fitted model in detail, the summary, coefficients and confidence intervals of the fitted model are calculated as shown below.

summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + goout + studytime, 
##     family = "binomial", data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9103  -0.7892  -0.5021   0.7574   2.6021  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.20483    0.59429  -5.393 6.94e-08 ***
## absences     0.07811    0.02244   3.481 0.000499 ***
## sexM         0.78173    0.26555   2.944 0.003242 ** 
## goout        0.72677    0.11994   6.059 1.37e-09 ***
## studytime   -0.42116    0.17058  -2.469 0.013551 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 381.31  on 377  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    absences        sexM       goout   studytime 
## -3.20483157  0.07810746  0.78173290  0.72676824 -0.42116184
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## absences    1.08123884 1.03577383 1.1324143
## sexM        2.18525582 1.30370562 3.7010763
## goout       2.06838526 1.64470314 2.6349716
## studytime   0.65628388 0.46510383 0.9099505


As shown above all the four variables are statistically significant. goout has the lowest p-value suggesting a strong association of going out with friends and high/low alcohol consumption. The positive coefficient for goout predictor suggests that all other variables being equal, going out with friends increase alchol consumption. In the logit model, the response variable is log odds: \(ln(odds) = ln(p/(1-p)) =\alpha+ \beta*x_1 + \beta*x_2 + … + \beta*x_n\). A unit increase in going out with friends increase the log odds by 0.72677. The variable absences and sex have also lowest p-vlaue 0.000499 and 0.003242, respectively. The positive coefficient for absences suggests that all other variables being equal, a unit increase in the absences increase alchol consumption. A unit increase in absences increase the log odds by 0.07811. sex is also the ohter signifcant value with p-value (0.003242) and suggesting association of the sex of the student with high\low alchol consumption. The positive coefficient for sex suggests that all other variables being equal, the male students are high likely alchol consumption. The male student inccrease the log odds by 0.78173. studytime is also the other siginficant variable and the negative coefficient for this variable suggests that all other variables being equal, a unit increase in studytime reduces the alchol consumption.A unit increase in studytime reduces the log odds by by 0.42116.

The ratio of expected “success” to “failures” defined as odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. The exponents of the coefficients of a logistic regression model interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable. The exponents of goout predector coefficient is 2.06838526 and this suggests a unit change in the goout while all other the variables being equal, the odd ratio is 2.06838526. The exponents of sex predector coefficient is 2.18525582 and this suggests a unit change in the sex while all other the variables being equal, the odd ratio is 2.18525582. In simlar way, The exponents of absences predector coefficient is 1.08123884 and this suggests a unit change in the absences while all other the variables being equal, the odd ratio is 1.08123884. The exponents of studytime predector coefficient is 1.08123884 and this suggests a unit change in the studytime while all other the variables being equal, the odd ratio is 0.65628388. Hence odds ratio for the significant goout, sex and absences variables are 2.06838526, 2.1852558 and 1.08123884 respectively. It means that goout, sex and absences increase high alcohol consumption by the factor of around 2.07, 2.19 and 1.08. Whereas the odds ratio for studytime is 0.65628388 and it suggests that studytime decreases high alchol consumption.

Predictive power of the model
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'modified_data'

modified_data <- mutate(modified_data, probability = probabilities)

# use the probabilities to make a prediction of high_use

modified_data <- mutate(modified_data, prediction = probability>0.5)

table(high_use = modified_data$high_use, prediction = modified_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     65   49
# the logistic regression model m and dataset alc (with predictions) are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(modified_data$high_use, modified_data$probability)
## [1] 0.2120419

The above 2x2 cross tabulation indicates that out of 382 observations 81 (65+ 16) were wrongly predicated. The average proportion of incorrect predictions in the data is about 21%. My model has 21% error which is better than the model in DataCamp exercises.


Bonus

# K-fold cross-validation

cv <- cv.glm(data = modified_data, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241

The 10-fold cross-validation result indicates is 0.2172775. There is no so much improvement in using the 10-fold cross-validation. There is no obvious smaller prediction error than what I have predicated in the above 2x2 cross tabulation (0.2120419).


Clustering and classification

#install.packages("corrplot") install corrplot package
#install.packages("kableExtra")
library(GGally)
library(ggplot2)
library(tidyr)
library(dplyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(ggpubr)
library(magrittr)
library(boot)
library(knitr)
library(kableExtra)
#library("DT")
# load the data
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The default installation of R comes with many (usually small) data sets. One of the data sets Boston we are dealing in this week exercise comes from MASS package. The data was originally published by (Harrison et al. 1978) that contains information about the Boston house-price data and later the data was also published by (Belsley et al. 1980). The Boston dataset has 506 observations and 14 different variables. Details about the datasets can be found from this two link [1] and [2]


Boston summary table

Boston_summary= do.call(cbind, lapply(Boston, summary))

kable(Boston_summary,"html", caption="Boston summary table") %>%   kable_styling(bootstrap_options ="condensed",  font_size = 13, full_width = F, position = "left") %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T,  color = "white", background = "green")
Boston summary table
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. 0.006320 0.00000 0.46000 0.00000 0.3850000 3.561000 2.9000 1.129600 1.000000 187.0000 12.60000 0.3200 1.73000 5.00000
1st Qu. 0.082045 0.00000 5.19000 0.00000 0.4490000 5.885500 45.0250 2.100175 4.000000 279.0000 17.40000 375.3775 6.95000 17.02500
Median 0.256510 0.00000 9.69000 0.00000 0.5380000 6.208500 77.5000 3.207450 5.000000 330.0000 19.05000 391.4400 11.36000 21.20000
Mean 3.613524 11.36364 11.13678 0.06917 0.5546951 6.284634 68.5749 3.795043 9.549407 408.2372 18.45553 356.6740 12.65306 22.53281
3rd Qu. 3.677083 12.50000 18.10000 0.00000 0.6240000 6.623500 94.0750 5.188425 24.000000 666.0000 20.20000 396.2250 16.95500 25.00000
Max. 88.976200 100.00000 27.74000 1.00000 0.8710000 8.780000 100.0000 12.126500 24.000000 711.0000 22.00000 396.9000 37.97000 50.00000

Visualizing the scatter plot matrix and examining the correltion between Boston dataset variables


p1=ggpairs(Boston,title="scatter plot matrix with correlation") 
p1 + theme(plot.title = element_text(size = rel(2)))


cor_matrix<-cor(Boston) %>% round(digits=2)

# visualize the correlation matrix

corrplot(cor_matrix, method="circle", type = "lower", cl.pos = "b", tl.col="black", tl.pos = "d", tl.cex = 1.2,title="Correlations plot",mar=c(0,0,1,0))


The above scatter plot matrix and correlation plot describe the distributions of the variables and the relationships between them. In the scatterplot matrix , the diagonal cells show histograms of each of the variables, the lower panel of the scatterplot matrix displayed a scatterplot of a pair of variables and the upper panel of the scatterplot matrix displayed the correlation value of a pair of variables. The correlation plot is a colored representation of the Boston correlation value whare the values are represented as different colors. The correlation values are ranging from red to blue (-1 to 1) and white is the middle value that is zero. For example, in the above correlation plot, we can see that the relation between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) represented in high intensity blue color circle and the scatter plot displayed a correlation value 0.91. Hence, as the index of accessibility to radial highways increases the full-value property tax rate per $10,000 also increase and vice versa. Moreover the above correlation plot displayed high intensity red color circle for the variables nitrogen oxides concentration (parts per 10 million) (nox) and weighted mean of distances to five Boston employment centres (dis) and the scatter plot matrix displayed the correlation value -0.77. Hence, as the nox increases, the dis decrease and vice versa.

boston_scaled <- scale(Boston)


boston_scaled<-as.data.frame(boston_scaled)


Boston scaled summary table

Boston_summary_scaled= do.call(cbind, lapply(boston_scaled, summary))

kable(Boston_summary_scaled,"html") %>%   kable_styling(bootstrap_options ="condensed",  font_size = 13, full_width = F, position = "left") %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T,  color = "white", background = "green")
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. -0.4193669 -0.4872402 -1.5563017 -0.2723291 -1.4644327 -3.8764132 -2.3331282 -1.2658165 -0.9818712 -1.3126910 -2.7047025 -3.9033305 -1.5296134 -1.9063399
1st Qu. -0.4105633 -0.4872402 -0.8668328 -0.2723291 -0.9121262 -0.5680681 -0.8366200 -0.8048913 -0.6373311 -0.7668172 -0.4875567 0.2048688 -0.7986296 -0.5988631
Median -0.3902803 -0.4872402 -0.2108898 -0.2723291 -0.1440749 -0.1083583 0.3170678 -0.2790473 -0.5224844 -0.4642132 0.2745872 0.3808097 -0.1810744 -0.1449159
Mean 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
3rd Qu. 0.0073892 0.0487240 1.0149946 -0.2723291 0.5980871 0.4822906 0.9059016 0.6617161 1.6596029 1.5294129 0.8057784 0.4332223 0.6024226 0.2682577
Max. 9.9241096 3.8004735 2.4201701 3.6647712 2.7296452 3.5515296 1.1163897 3.9566022 1.6596029 1.7964164 1.6372081 0.4406159 3.5452624 2.9865046


One of the main goal of standardizeing the data is allows to compare different data setes. The Boston data setes is scaled using the R funciton scale(). From the above summary table of the scaled data set, we can clearly see that each variable has a mean 0.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
pairs(boston_scaled[1:13], main = "Scaled scatter plot matrix",
      pch = 21, bg = c("red", "green3", "blue","yellow")[boston_scaled$crime],
      oma=c(4,4,6,12),upper.panel = NULL)
par(xpd=TRUE)
legend(0.85, 0.7, as.vector(unique(boston_scaled$crime)),  
       fill=c("red", "green3", "blue","yellow"))


In scaling the Boston variable, subtract its mean and divided by its standard devaiation and all the mean became 0. moreover all the Boston varabiable is standardized using the formula \(\frac{X-\mu}{\sigma}\), there is a change in each of the varible values and we can clearly see big differnce in scaled Boston and orginal Boston data set. For example, if we compare the summary table of the scaled and orginal Boston data set the maximum and minum values are not the same and also the Quantile values. However the corrlation values has not changed in both datasets.

n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2475248 0.2351485 0.2599010 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.93842460 -0.8831472 -0.120902136 -0.8809427  0.4188738
## med_low  -0.06897372 -0.3072406  0.003267949 -0.5750552 -0.1519524
## med_high -0.37711365  0.1960403  0.224988857  0.3490041  0.1927490
## high     -0.48724019  1.0170492 -0.047351911  1.0360695 -0.3817375
##                 age        dis        rad        tax     ptratio
## low      -0.8653478  0.9087135 -0.6892330 -0.7413149 -0.41516187
## med_low  -0.3354630  0.3909348 -0.5374145 -0.4697906 -0.03997033
## med_high  0.4028898 -0.3617635 -0.4148912 -0.3136295 -0.30425315
## high      0.8169950 -0.8672730  1.6388211  1.5145512  0.78158339
##                black       lstat        medv
## low       0.37684956 -0.75327976  0.48258098
## med_low   0.31297673 -0.10650563 -0.02661776
## med_high  0.04157575 -0.06587699  0.24021680
## high     -0.85618161  0.90698582 -0.68701197
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.06095553  0.64584169 -0.9626994
## indus    0.06304883 -0.27200735  0.1030758
## chas    -0.08992486 -0.03925346  0.1308242
## nox      0.34409825 -0.82599630 -1.2330147
## rm      -0.10050206 -0.07707864 -0.1828312
## age      0.23700423 -0.30949601 -0.1307500
## dis     -0.03675350 -0.25581011  0.1508864
## rad      3.37670907  0.89172096 -0.2146238
## tax      0.06061471  0.07581249  0.6146298
## ptratio  0.09863067  0.02020399 -0.1700169
## black   -0.10072861  0.03094063  0.1136943
## lstat    0.30535854 -0.17074967  0.4629845
## medv     0.23023425 -0.38915489 -0.1612996
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9580 0.0311 0.0109


From the above prior probabilities, we can see that the proportion of each groups “low, med_low, med_high, high” in the total observation. Approximately the counts of each group is about 25% of the total observation of the Boston data set. Moreover the group means displayed the mean values of each variable in each groups and we can see that how the mean differs between the groups. Coefficients of linear discriminants is the coefficient of each of variables in the linear combination of the variables. For example the first discriminant function is a linear combination of the variables: \(0.0888*zn + 0.0891*indus -0.0846*chas + 0.165*nox..........+ 0.179*lstat + 0.221*medv\). The proportion of trace explains the between groups variance, in my analysis we can see that the first discriminant explains more than 95% of the between group variance in the Boston dataset


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 1.4)


In the above biplot, the index of accessibility to radial highways (rad) has the largest possible variance (that is, accounts for as much of the variability in the Boston data as possible).

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      11        0    0
##   med_low    6      14        6    0
##   med_high   0      15       14    2
##   high       0       0        0   22

Every time I run the code the cross-tabulation value is differnt due to random sample of the train and test data.However in the cross-tabulation, we can see that my model correctly predicted approximately about 70% and my model Misclassification rate approximately 30

data('Boston')

boston_scaled2 <- scale(Boston)

dist_eu <- dist(boston_scaled2)

set.seed(12345)

k_max <- 13

twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

#qplot(x = 1:k_max, y = twcss, geom =c("point", "line"),span = 0.2)

# k-means clustering
b=x = 1:k_max
aa=data.frame(cbind(b,twcss))

ggplot(data=aa, aes(x=b, y=twcss, group=1)) +
  geom_line(color="red")+
  geom_point() + ggtitle("Optimal Number of Clusters")


One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Using the Elbow method, for each successive increase in the number of clusters, the substantial decrease in the within groups sum of squares wcss was observed. The optimal number of clusters is identified when the radical total WCSS drops observed in the line plot. From the above ggpairs plot, we can say that after 2 clusters the observed difference in the within-cluster dissimilarity is not radical. Consequently, we can say with some reasonable confidence that the optimal number of clusters to be used is 2.


Assuming the above optimal number cluster is valid and the K-Means algorithm run again and plot the results are displayed below:


km <-kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, main = "K-means clustering",col = km$cluster, upper.panel = NULL)

Bonus

boston_scaled_bonus<-as.data.frame(scale(Boston))
set.seed(12345)
km_bonus<-kmeans(dist_eu, centers = 4)

myclust<-data.frame(km_bonus$cluster)
boston_scaled_bonus$clust<-km_bonus$cluster
lda.fit_bonus<-lda(clust~., data = boston_scaled_bonus)
lda.fit_bonus
## Call:
## lda(clust ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2114625 0.4308300 0.2272727 0.1304348 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
## 2 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 3  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
## 4  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
##          age        dis        rad        tax     ptratio       black
## 1 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
## 2 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 3  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
## 4  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
##        lstat        medv
## 1 -0.8200275  1.11919598
## 2 -0.2813666 -0.01314324
## 3  0.6116223 -0.54636549
## 4  1.1930953 -0.81904111
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim     0.18113078  0.5012256  0.60535205
## zn       0.43297497  1.0486194 -0.67406151
## indus    1.37753200 -0.3016928 -1.07034034
## chas    -0.04307937  0.7598229  0.22448239
## nox      1.04674638  0.3861005  0.33268952
## rm      -0.14912869  0.1510367 -0.67942589
## age     -0.09897424 -0.0523110 -0.26285587
## dis      0.13139210  0.1593367  0.03487882
## rad      0.65824136 -0.5189795 -0.48145070
## tax      0.28903561  0.5773959 -0.10350513
## ptratio  0.22236843 -0.1668597  0.09181715
## black   -0.42730704 -0.5843973 -0.89869354
## lstat    0.24320629  0.6197780  0.01119242
## medv     0.21961575  0.9485829  0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}


# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes)
lda.arrows(lda.fit_bonus, myscale = 3)

In the above biplot, the proportion of non-retail business acres per town (indus) has the largest possible variance (that is, accounts for as much of the variability in the Boston data as possible) when I used the cluster number 4